Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up bot regexes #7549

Merged
merged 7 commits into from
Feb 12, 2024
Merged

Conversation

biochimia
Copy link
Contributor

Description:

The changes in this pull request clean up regexes used to identify bots. Each change is performed in its own commit with a description of the change. Overall, the changes make the expressions stricter and more consistent, thus reducing the cognitive load for anyone perusing them.

In summary, these are the changes introduced:

  • Explicitly match dots when dots are expected, essentially using \. where an actual dot is expected, such as in URLs;
  • Harmonize version parsing to [\d.]+, whereas previously there were a few different ways bot versions were parsed essentially to the same effect;
  • Use non-capturing regexes, where the captured string is not relevant;
  • Drop wildcard matches at end of regexes, as they don't add value in identifying bots
  • Prefer having - at end of character classes.

Review

Bot version parsing in bot regexes is consistently done using `[\d.]+`.

In particular, the following changes are made:

- drop wrapper parentheses, where it is used, as the bot version is not
  exposed by the parser, and there is inconsistent use of grouping
  parentheses in the various regexes;
- `[\d+.]` is assumed to be a typo picked up from the use of `\d+.`
  outside character classes;
- `[\d+\.]` additionally uses a superfluous escape on the dot, which is
  not needed inside character classes;
- plain `\d` or `[0-9]` are replaced with the common expression, as they
  will match the same agent strings, but include more of the version
  string in the match–mainly, this drives consistency;

The main change that should arise here is that the plus character will
no longer be recognised as part of the bot/version match. The tests
don't suggest that this will be an issue. (If needed, the `+` can be
brought back, while keeping consistency)
Given that the groupings don't have a meaning, there's little point in
keeping them as part of the regexes. In addition, superfluous
parentheses are dropped.
The wildcard matches extend the match, but don't add value as the match
is already made at that point.
This avoids the need to escape the dash in character classes.
@mindreader
Copy link
Contributor

This is a really nice pr!

sanchezzzhak
sanchezzzhak previously approved these changes Feb 1, 2024
@sanchezzzhak
Copy link
Collaborator

@biochimia pls resolve conflict for PR

@liviuconcioiu
Copy link
Collaborator

@sanchezzzhak can I resolve the conflicts, so we can get this PR merged?

@sanchezzzhak
Copy link
Collaborator

yes, (you may need the rights of a Collaborator)

  • git fetch upstream pull/7549/head:pr_7549
  • next fork branch git checkout pr_7549
  • fetch update by remote git git fetch upstream master
  • merge remote to into pr_7549 git merge upstream/master
  • resolve conflict
  • git commit
  • git push
    done;

@sgiehl You can grant the same Collaborator rights for liviuconcioiu?

@biochimia
Copy link
Contributor Author

Sorry for the radio silence, as I haven't been able to devote any time to this.

I don't object to @liviuconcioiu or anyone else picking up the changes and addressing the conflicts as needed—if you have the opportunity. I assume that the changes will need to be (at least partially) redone, in order to address the same issues in any new or updated regexes.

If you don't beat me to it, I will try to find some time later this week to apply the changes on a fresh pull.

@sgiehl
Copy link
Member

sgiehl commented Feb 12, 2024

@sgiehl You can grant the same Collaborator rights for liviuconcioiu?

I'll send an invite.

@liviuconcioiu
Copy link
Collaborator

Thanks both for help!

@sanchezzzhak sanchezzzhak merged commit 10c995f into matomo-org:master Feb 12, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants